Adding promotion for UnknownType per V3+ spec by ldsantos0911 · Pull Request #2155 · apache/iceberg-python

ldsantos0911 · 2025-06-26T21:36:31Z

Rationale for this change

When attempting to write a PyArrow dataframe that has a null field corresponding to a nullable Table field, I see this error:

ValueError: Mismatch in fields:
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Table field              ┃ Dataframe field           ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ❌ │ 1: col1: optional string │ 1: col1: optional unknown │
│ ✅ │ 2: col2: optional string │ 2: col2: optional string  │
└────┴──────────────────────────┴───────────────────────────┘

Per the V3 spec, UnknownType (as pa.null() is cast) should be promotable to any type. This change attempts to unblock situations where a DataFrame may end up having a null type for an optional field while also incorporating the V3 spec.

Note: This issue is related but was based on an older version. Nonetheless, the underlying situation is still blocked.

Are these changes tested?

Yes. Added a couple of relevant unit tests. Additionally, I did a manual sanity check:

Python 3.10.18 (main, Jun  4 2025, 08:06:20) [Clang 16.0.0 (clang-1600.0.26.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> from pyiceberg.catalog import load_catalog
>>> from pyiceberg.schema import Schema
>>> from pyiceberg.types import StringType, NestedField
>>> warehouse_path = '/tmp/warehouse'
>>> catalog = load_catalog('default', **{'type': 'sql', 'uri': f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", 'warehouse': f"file://{warehouse_path}"})
>>> table = catalog.load_table('default.test2')
>>> table.schema()
Schema(NestedField(field_id=1, name='col1', field_type=StringType(), required=False), NestedField(field_id=2, name='col2', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])
>>> df = pa.Table.from_pylist([{'col1': None, 'col2': 'blah'}])
>>> df.schema
col1: null
col2: string
>>> table.append(df)
>>> x = table.scan().to_arrow()
>>> x
pyarrow.Table
col1: large_string
col2: string
----
col1: [[null]]
col2: [["blah"]]

Are there any user-facing changes?

Yes, this will allow UnknownType promotion.

gabeiglio · 2025-07-16T17:24:41Z

Thanks for working on this! The changes LGTM. Tagging @Fokko for another pair of eyes :)

kevinjqliu

I think there are 2 different issues here.

For V3 tables, yes, I think we can map pyarrow's null type to UnknownType
For V1/V2 tables, I think its correct that visit_pyarrow should throw for pyarrow null type. There's not an equivalent iceberg type mapping for pyarrow's null type.

The example in #2119, the workaround is to make sure that the pyarrow table does not contain null type. you can do that by explicitly passing the schema when creating the pyarrow table

import pyarrow as pa

# table created with the below pyarrow schema
schema = pa.schema(
    [
        pa.field("col1", pa.string(), nullable=True),
    ]
)

df = pa.Table.from_pylist(
    [
        {"col1": None}
    ],
    schema=schema
)

df.schema

ldsantos0911 · 2025-07-21T21:46:20Z

I think there are 2 different issues here.

For V3 tables, yes, I think we can map pyarrow's null type to UnknownType For V1/V2 tables, I think its correct that visit_pyarrow should throw for pyarrow null type. There's not an equivalent iceberg type mapping for pyarrow's null type.

The example in #2119, the workaround is to make sure that the pyarrow table does not contain null type. you can do that by explicitly passing the schema when creating the pyarrow table
import pyarrow as pa

# table created with the below pyarrow schema
schema = pa.schema(
    [
        pa.field("col1", pa.string(), nullable=True),
    ]
)

df = pa.Table.from_pylist(
    [
        {"col1": None}
    ],
    schema=schema
)

df.schema

Thank you for that additional context regarding V1/V2; I believe that makes sense. As far as V3 tables go, it looks like null is already being cast to UnknownType. However, as far as I can tell, the missing link is supporting type promotion to make it so UnknownType is evaluated as compatible with XType. Otherwise, as noted above, we see ValueError: Mismatch in fields.

Fokko · 2025-08-12T15:10:27Z

pyiceberg/schema.py


+@promote.register(UnknownType)
+def _(file_type: UnknownType, read_type: IcebergType) -> IcebergType:
+    return read_type  # Per V3 Spec, "Unknown" can be promoted to any type


If I read the spec carefully, I think it only allows for promoting to primitive types:

So, it cannot be promoted to a {List,Map,Struct}Type.

I interpreted this as promotions available for a primitive type rather than promotions to a primitive type. i.e. UnknownType is a primitive type and these are the types it can be promoted to. It should be easy enough to change, so I'm fine either way. But, to me it would make sense that if a {List,Map,Struct}Type field is nullable and a null PyArrow field is being evaluated for compatibility, that evaluation should succeed. What are your thoughts?

I think we need to extend this logic a bit:

Suggested change

return read_type # Per V3 Spec, "Unknown" can be promoted to any type

# Per V3 Spec, "Unknown" can be promoted to any Primitive type

if isinstance(read_type, PrimitiveType):

return file_type

else:

raise ResolveError(f"Cannot promote {file_type} to {read_type}")

Fokko · 2025-08-12T15:12:48Z

tests/io/test_pyarrow_visitor.py

+
+def test_schema_conversion_null_column(table_schema_simple: Schema) -> None:
+    pyarrow_schema: pa.Schema = schema_to_pyarrow(table_schema_simple)
+    new_field = pyarrow_schema.field(2).with_type(pa.null())  # Make the optional boolean field null for testing


How about creating a new schema with just a field with a NestedType instead? I don't think we want to re-assign fields as that makes the test harder te read.

Sure I can make a new schema. I think I was just trying to get it working and forgot to clean this up 😀

Fokko · 2025-08-12T15:13:48Z

@ldsantos0911 Thanks for working on this, left a few comments, but this PR makes a lot of sense to me 👍

Adding promotion for UnknownType per V3+ spec

35cf34a

ldsantos0911 marked this pull request as ready for review June 26, 2025 21:56

Fokko self-requested a review June 27, 2025 06:19

fix linting error

531e526

kevinjqliu reviewed Jul 19, 2025

View reviewed changes

kevinjqliu mentioned this pull request Jul 19, 2025

When writing data from a PyArrow DataFrame, how should we handle 'null' Fields? #2119

Closed

Fokko reviewed Aug 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding promotion for UnknownType per V3+ spec#2155

Adding promotion for UnknownType per V3+ spec#2155
ldsantos0911 wants to merge 2 commits intoapache:mainfrom
ldsantos0911:support_unknown_promotion

ldsantos0911 commented Jun 26, 2025

Uh oh!

gabeiglio commented Jul 16, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

ldsantos0911 commented Jul 21, 2025

Uh oh!

Fokko Aug 12, 2025 •

edited

Loading

Uh oh!

ldsantos0911 Aug 12, 2025

Uh oh!

Fokko Aug 18, 2025

Uh oh!

Fokko Aug 12, 2025

Uh oh!

ldsantos0911 Aug 12, 2025

Uh oh!

Fokko commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-    return read_type  # Per V3 Spec, "Unknown" can be promoted to any type
+    # Per V3 Spec, "Unknown" can be promoted to any Primitive type
+    if isinstance(read_type, PrimitiveType):
+        return file_type
+    else:
+        raise ResolveError(f"Cannot promote {file_type} to {read_type}")

Conversation

ldsantos0911 commented Jun 26, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

gabeiglio commented Jul 16, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

ldsantos0911 commented Jul 21, 2025

Uh oh!

Fokko Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldsantos0911 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ldsantos0911 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fokko Aug 12, 2025 •

edited

Loading